BioPython v1.79 re module v2.2.1 matplotlib v3.4.2 pandas v1.3.3
Python 3.8.12
Selected Jupyter core packages... IPython : 7.27.0 ipykernel : 6.4.1 ipywidgets : 7.6.5 jupyter_client : 7.0.1 jupyter_core : 4.8.1 jupyter_server : 1.4.1 jupyterlab : 3.1.7 nbclient : 0.5.3 nbconvert : 6.1.0 nbformat : 5.1.3 notebook : 6.4.3 qtconsole : 5.1.1 traitlets : 5.1.0
We are dealing with .gb/.gbk files with single records each.
From (https://warwick.ac.uk/fac/sci/moac/people/students/peter_cock/python/genbank/)-
Depending on the type of GenBank file(s) you are interested in, they will either contain a single record, or multiple records. You can easily determine this by looking at the raw file - each record will start with a LOCUS line, followed by various other header lines, usually a list of features, the sequence data, and ends with a // line (slash slash).
Locations provided by BioPython is optimum for python purposes.
This way we can directly slice seq string using locations provided to obtain the seq for the features of our interest.
The DDBJ/ENA/GenBank Feature Table Definition: Documentation of features in genbank files. Very good document, must go-through this once.
Source:
Locus tag: Locus_tags are identifiers that are systematically applied to every gene in a genome. These tags have become surrogate gene names by the biological community. If two submitters of two different genomes use the same systematic names to describe two very different genes in two very different genomes, it can be very confusing. In order to prevent this from happening INSD has created a registry of locus_tag prefixes. Submitters of eukaryotic and prokaryotic genomes should register their prefix prior to submitting their genome. All components of a project (such as multiple chromosomes or plasmids, etc) should use the same locus_tag prefix.
Source:
Useful videos for the analysis done later-
Both videos are on Youtube @Bioinformatics Coach
KeyError resolution -
gene_name = gene.qualifiers['gene'][0]
gene_name = gene.qualifiers.get('gene',['unavailable'])[0]
NOTE: Our file for Staphylococcus aureus (ATCC® 43300™) (https://genomes.atcc.org/genomes/79691302ed634fef) had only CDS as features. So, a code-block was added to handle all such files which will have only CDS as features instead of the usual both genes and CDS as features.
How to iterate over a given directory: https://stackoverflow.com/questions/10377998/how-can-i-iterate-over-files-in-a-given-directory
gram_positive_Enterococcus_faecium.gb Name: NZ_CP038996, Features count: 5490 gram_negative_Pseudomonas_aeruginosa.gb Name: NC_002516, Features count: 11908 gram_positive_Streptococcus_pneumoniae.gb Name: NZ_CP020549, Features count: 4328 gram_negative_Helicobacter_pylori.gb Name: CP071982, Features count: 3052 gram_positive_Clostridium_botulinum.gb Name: NC_009495, Features count: 7379 gram_positive_Staphylococcus_epidermidis.gb Name: NZ_CP035288, Features count: 4671 gram_positive_Cutibacterium_acnes.gb Name: NC_021085, Features count: 4889 gram_positive_Corynebacterium_diphtheriae.gb Name: NZ_CP025209, Features count: 4565 gram_negative_Neisseria_gonorrhoeae.gb Name: NZ_AP023069, Features count: 4533 gram_positive_Bacillus_subtilis.gb Name: NC_000964, Features count: 9074 gram_negative_Escherichia_coli_BW25113.gb Name: CP009273, Features count: 9462 gram_positive_Listeria_monocytogenes.gb Name: NC_003210, Features count: 9849 gram_negative_Campylobacter_jejuni.gb Name: NC_002163, Features count: 6016 gram_positive_Clostridium_perfringens.gb Name: NZ_CP075979, Features count: 5954 gram_negative_Acinetobacter_baumannii.gb Name: NZ_CP043953, Features count: 7445 gram_negative_Chlamydia_trachomatis.gb Name: NC_000117, Features count: 1869 gram_negative_Klebsiella_pneumoniae.gb Name: NC_016845, Features count: 10894 gram_positive_Staphylococcus_aureus_ATCC_43300_chromosome.gbk Name: 1, Features count: 2734 gram_positive_Staphylococcus_haemolyticus.gb Name: NZ_CP013911, Features count: 5041 gram_negative_Serratia_marcescens.gb Name: NZ_CP027798, Features count: 9973 gram_positive_Bacillus_anthracis.gb Name: NC_007530, Features count: 11041 gram_negative_Salmonella_enterica.gb Name: NC_003197, Features count: 14045
{1: -200, 2: -199, 3: -198, 4: -197, 5: -196, 6: -195, 7: -194, 8: -193, 9: -192, 10: -191, 11: -190, 12: -189, 13: -188, 14: -187, 15: -186, 16: -185, 17: -184, 18: -183, 19: -182, 20: -181, 21: -180, 22: -179, 23: -178, 24: -177, 25: -176, 26: -175, 27: -174, 28: -173, 29: -172, 30: -171, 31: -170, 32: -169, 33: -168, 34: -167, 35: -166, 36: -165, 37: -164, 38: -163, 39: -162, 40: -161, 41: -160, 42: -159, 43: -158, 44: -157, 45: -156, 46: -155, 47: -154, 48: -153, 49: -152, 50: -151, 51: -150, 52: -149, 53: -148, 54: -147, 55: -146, 56: -145, 57: -144, 58: -143, 59: -142, 60: -141, 61: -140, 62: -139, 63: -138, 64: -137, 65: -136, 66: -135, 67: -134, 68: -133, 69: -132, 70: -131, 71: -130, 72: -129, 73: -128, 74: -127, 75: -126, 76: -125, 77: -124, 78: -123, 79: -122, 80: -121, 81: -120, 82: -119, 83: -118, 84: -117, 85: -116, 86: -115, 87: -114, 88: -113, 89: -112, 90: -111, 91: -110, 92: -109, 93: -108, 94: -107, 95: -106, 96: -105, 97: -104, 98: -103, 99: -102, 100: -101, 101: -100, 102: -99, 103: -98, 104: -97, 105: -96, 106: -95, 107: -94, 108: -93, 109: -92, 110: -91, 111: -90, 112: -89, 113: -88, 114: -87, 115: -86, 116: -85, 117: -84, 118: -83, 119: -82, 120: -81, 121: -80, 122: -79, 123: -78, 124: -77, 125: -76, 126: -75, 127: -74, 128: -73, 129: -72, 130: -71, 131: -70, 132: -69, 133: -68, 134: -67, 135: -66, 136: -65, 137: -64, 138: -63, 139: -62, 140: -61, 141: -60, 142: -59, 143: -58, 144: -57, 145: -56, 146: -55, 147: -54, 148: -53, 149: -52, 150: -51, 151: -50, 152: -49, 153: -48, 154: -47, 155: -46, 156: -45, 157: -44, 158: -43, 159: -42, 160: -41, 161: -40, 162: -39, 163: -38, 164: -37, 165: -36, 166: -35, 167: -34, 168: -33, 169: -32, 170: -31, 171: -30, 172: -29, 173: -28, 174: -27, 175: -26, 176: -25, 177: -24, 178: -23, 179: -22, 180: -21, 181: -20, 182: -19, 183: -18, 184: -17, 185: -16, 186: -15, 187: -14, 188: -13, 189: -12, 190: -11, 191: -10, 192: -9, 193: -8, 194: -7, 195: -6, 196: -5, 197: -4, 198: -3, 199: -2, 200: -1}
How to iterate rows of a dataframe: https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas
GBdata_gram_negative_Pseudomonas_aeruginosa.csv 5697 GBdata_gram_negative_Helicobacter_pylori.csv 1525 GBdata_gram_positive_Cutibacterium_acnes.csv 2435 GBdata_gram_negative_Neisseria_gonorrhoeae.csv 2263 GBdata_gram_positive_Staphylococcus_epidermidis.csv 2325 GBdata_gram_positive_Clostridium_botulinum.csv 3667 GBdata_gram_negative_Serratia_marcescens.csv 4979 GBdata_gram_negative_Salmonella_enterica.csv 4605 GBdata_gram_positive_Staphylococcus_haemolyticus.csv 2509 GBdata_gram_negative_Chlamydia_trachomatis.csv 935 GBdata_gram_positive_Bacillus_anthracis.csv 5479 GBdata_gram_negative_Campylobacter_jejuni.csv 1668 GBdata_gram_positive_Staphylococcus_aureus_ATCC_43300_chromosome.csv 2656 GBdata_gram_negative_Escherichia_coli_BW25113.csv 4490 GBdata_gram_positive_Streptococcus_pneumoniae.csv 2157 GBdata_gram_positive_Bacillus_subtilis.csv 4536 GBdata_gram_positive_Corynebacterium_diphtheriae.csv 2279 GBdata_gram_positive_Clostridium_perfringens.csv 2954 GBdata_gram_negative_Klebsiella_pneumoniae.csv 5404 GBdata_gram_positive_Enterococcus_faecium.csv 2735 GBdata_gram_negative_Acinetobacter_baumannii.csv 3720 GBdata_gram_positive_Listeria_monocytogenes.csv 3055
/var/folders/qw/9wxdq28x5wgcw76z273w4lp00000gn/T/ipykernel_38457/4227904811.py:7: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`). plt.figure(figsize=(15, 5), facecolor=(1, 1, 1))
/var/folders/qw/9wxdq28x5wgcw76z273w4lp00000gn/T/ipykernel_38457/4215507962.py:10: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`). plt.figure(figsize=(15, 5), facecolor=(1, 1, 1))